communication-efficient sgd
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always guarantee convergence, and it is not clear whether they can be improved. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes for gradient updates which provides convergence guarantees. QSGD allows the user to smoothly trade off \emph{communication bandwidth} and \emph{convergence time}: nodes can adjust the number of bits sent per iteration, at the cost of possibly higher variance. We show that this trade-off is inherent, in the sense that improving it past some threshold would violate information-theoretic lower bounds. QSGD guarantees convergence for convex and non-convex objectives, under asynchrony, and can be extended to stochastic variance-reduced techniques. When applied to training deep neural networks for image classification and automated speech recognition, QSGD leads to significant reductions in end-to-end training time. For example, on 16GPUs, we can train the ResNet152 network to full accuracy on ImageNet 1.8x faster than the full-precision variant.
Communication-efficient SGD: From Local SGD to One-Shot Averaging
We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among N workers, who can take SGD steps and coordinate with a central server. While it is possible to obtain a linear reduction in the variance by averaging all the stochastic gradients at every step, this requires a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism.The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs \Omega ( \sqrt{T}) communications for T local gradient steps in order for the error to scale proportionately to 1/(NT), this has been successively improved in a string of papers, with the state of the art requiring \Omega \left( N \left( \mbox{ poly} (\log T) \right) \right) communications. In this paper, we suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows.
CSER: Communication-efficient SGD with Error Reset
The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. The key idea in CSER is first a new technique called error reset'' that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. Second we introduce partial synchronization for both the gradients and the models, leveraging advantages from them. We prove the convergence of CSER for smooth non-convex problems. Empirical results show that when combined with highly aggressive compressors, the CSER algorithms accelerate the distributed training by nearly 10\times for CIFAR-100, and by 4.5\times for ImageNet.
Reviews: QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
Update: I decrease slightly the grade due to the mismatch between theoretical and practical results that could be better covered. Still this paper has strong experimental results and some theoretical results. I would encourage the authors to improve on the gap between the two. In this paper the author introduce Quantized SGD (QSGD), a scheme for reducing the communication cost of SGD when performing distributed optimization. The quantization scheme is useful as soon as one has to transmit gradients between different machines.
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
Alistarh, Dan, Grubic, Demjan, Li, Jerry, Tomioka, Ryota, Vojnovic, Milan
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always guarantee convergence, and it is not clear whether they can be improved. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes for gradient updates which provides convergence guarantees. QSGD allows the user to smoothly trade off \emph{communication bandwidth} and \emph{convergence time}: nodes can adjust the number of bits sent per iteration, at the cost of possibly higher variance.
Hyper-Sphere Quantization: Communication-Efficient SGD for Federated Learning
Dai, Xinyan, Yan, Xiao, Zhou, Kaiwen, Yang, Han, Ng, Kelvin K. W., Cheng, James, Fan, Yu
The high cost of communicating gradients is a major bottleneck for federated learning, as the bandwidth of the participating user devices is limited. Existing gradient compression algorithms are mainly designed for data centers with high-speed network and achieve $O(\sqrt{d} \log d)$ per-iteration communication cost at best, where $d$ is the size of the model. We propose hyper-sphere quantization (HSQ), a general framework that can be configured to achieve a continuum of trade-offs between communication efficiency and gradient accuracy. In particular, at the high compression ratio end, HSQ provides a low per-iteration communication cost of $O(\log d)$, which is favorable for federated learning. We prove the convergence of HSQ theoretically and show by experiments that HSQ significantly reduces the communication cost of model training without hurting convergence accuracy.